refactor: rai_bench by jmatejcz · Pull Request #517 · RobotecAI/rai

jmatejcz · 2025-04-09T12:09:26Z

Purpose

Unify folder structure and naming of benchmarks
Refactor tool_calling bench as there are conflicts between branches:
https://github.com/RobotecAI/rai/tree/mk/feat/spatial-reasoning-tasks
https://github.com/RobotecAI/rai/tree/jm/feat/tool-benchmark-custom-interfaces
which both changed the tool calling benchmark and were done in a hurry. That resulted in a lot of conflicts and not the best code.

Merge and unify taks that are already on development with spatial, navigation and custom interfaces tasks.
Improve and unify logging and saving results

This PR is big in size, please follow below changes descriptions. Firstly focus on 1st and 2nd points. New frame is essential in this PR, it dictates how validation is executed and a lot of other changes in this PR are adjusted to this frame.

Related PRs and branches will be also closed when this PR is merged.

⚠️ Warning: This PR does not provide full benchmark, that is ready to test models. Only sample tasks were created. The action mocks/tests are also missing as I decided to do this in separate PRs.

Proposed Changes

Restructured package:

all naming unified, now there the benchmarks: manipulation_o3de and tool_calling_agent. Every folder related to the benchmark will have exactly that name.
all code providing framework for creating benchmark in adequate folder
all code related to specific benchmark implementation in examples/ folder
experiment logs and results in experiments/ folder

files/folders responsible for interfaces, tasks, benchmarks etc. named the same across benchmarks

       ├── rai_bench
       │ ├── examples/
       │ ├── experiments/
       │ ├── tool_calling_agent/
       │ ├── manipulation_o3de/

REST OF THE POINTS ARE FOR TOOL CALLING AGENT BENCHMARK:
2. New frame for tool_calling_agent benchmark:

SubTask - smallest block, responsible for validating single tool call (ex. ListTopics)
Validator - consists of subtasks. Based on the validator type, checks whether all subtasks were completed in a certain way
Task - consists of validators. Every Validator can be treated as single step that is scored atomically. Visit examples/tool_calling_agent/tasks.py for more intuition. Every Task has always the same prompt and available tools, only the validation methods can be parametrized. On top of validator, you can pass extra_tool_calls param to allow model to correct itself.

Migrated, Refactored and Unified tool_calling_agent benchmark tasks:
tasks

Tasks that were already on development - splited into basic and manipulation
Custom interface tasks - from feat: custom ros2 interfaces benchmark #487
Spatial reasoning tasks - from feat: spatial reasoning tasks #493
Navigation tasks - from https://github.com/RobotecAI/rai/tree/mk/feat/tool-calling-bench-navigation-tasks

Mocks

Mocks of tools (mostly imported from above mentioned branches)
Refactored action mocks, but here more work is needed - will be continued in this issue -> Action mocks #526

Models
models

Pydantic models that reflect messages from ros2, which enables validation

Unit tests
tests
for subtasks and validators
New GetInterfaceTool
old version didnt return types of fields

imported from https://github.com/RobotecAI/rai/tree/jm/feat/tool-benchmark-custom-interfaces
now returns one big string as ros interface show cli command

Results and logs

(Unchanged) all logs are logged to benhcmark.log files
Result file is intended to be source of info for further processing
structure of results:

results now will have list of validators that will show what is expected by every validator
followed by passed list which holds bool for every validator
followed by score(which is redundant with passed, but its weird to not have score in results xd, if you have ideas here, please share)
followed by errors which is list of lists, where every validator has its own list of errors.

Args passed when running tool calling agent benchmark
User can pass 2 args when running benchmark - model_name and vendor
Small docs validation as i wanted to paste image. This docs is small for now but i guess there will be docs for benchmarks in the future anyway.
Script to test multiple models, different benchamrks or couple repeats in one go https://github.com/RobotecAI/rai/blob/jm/refactor/rai_bench/src/rai_bench/rai_bench/examples/test_models.py

Issues

515
related PRs and branches:
#493
#487
mk/feat/tool-calling-bench-navigation-tasks

Testing

Test single

python src/rai_bench/rai_bench/examples/tool_calling_agent/main.py --model-name llama3.2 --vendor ollama

tests:

pytest tests/rai_bench/

script running benchmarks:

python src/rai_bench/rai_bench/examples/test_models.py

next steps

Actions mocks -> Action mocks #526
Full benchmark -> defining validators and tasks

jmatejcz · 2025-04-09T12:33:00Z

@MagdalenaKotynia @maciejmajek please take a look, if you like this reconstruction of package structure and the new frame. If yes i will proceed with applying this refactor to other tasks and changes from 2 conflicted branches.

The refactor is not completed yet, i only applied the new frame to 2 task as an example, so don't pay attention to the not touched parts of code

jmatejcz · 2025-04-10T13:05:00Z

How to log errors when extra calls passed? for example in 3 calls there are errors but in 4th agent done it correctly, should we log the previous 3, even if validator passed eventually?

MagdalenaKotynia · 2025-04-10T13:34:54Z

src/rai_bench/rai_bench/tool_calling_agent/scores_tracing.py

                trace_id=str(run_id),
                name="tool calls result",
-                value=float(success),
+                value=float(score),


Score is already float, so you don't need to convert it to float.

fixed here: ef8ae2f

MagdalenaKotynia · 2025-04-10T13:35:12Z

src/rai_bench/rai_bench/tool_calling_agent/scores_tracing.py

                run_id=run_id,
                key="tool calls result",
-                score=float(success),
+                score=float(score),


same: https://github.com/RobotecAI/rai/pull/517/files#r2037418671

fixed here: ef8ae2f

MagdalenaKotynia · 2025-04-10T13:52:28Z